{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Markdown Processing with fenic\n",
    "\n",
    "This example demonstrates how to use Fenic to process, analyze, and extract structured information from an academic paper written in Markdown format. \n",
    "\n",
    "The workflow covers the entire pipeline—from loading the Markdown document, generating a table of contents, and extracting document sections, to parsing and structuring references using both markdown-specific and JSON-based techniques.\n",
    "\n",
    "**Key steps include**:\n",
    "- Loading and casting the Markdown document for analysis.\n",
    "- Generating a table of contents and extracting document sections.\n",
    "- Filtering and parsing the References section to extract individual citations.\n",
    "- Using both text and JSON-based approaches to structure and analyze reference data.\n",
    "\n",
    "This notebook provides a practical example of how Fenic can be used to transform unstructured Markdown documents into structured, queryable data for further analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the fenic Session\n",
    "\n",
    "This cell configures and initializes a Fenic session with semantic capabilities, enabling the use of a language model for advanced Markdown document analysis and extraction tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import fenic as fc\n",
    "\n",
    "config = fc.SessionConfig(\n",
    "    app_name=\"markdown_processing\",\n",
    "    semantic=fc.SemanticConfig(\n",
    "        language_models= {\n",
    "            \"mini\": fc.OpenAIModelConfig(\n",
    "                model_name=\"gpt-4o-mini\",\n",
    "                rpm=500,\n",
    "                tpm=200_000\n",
    "            )\n",
    "        }\n",
    "    )\n",
    ")\n",
    "\n",
    "# Initialize fenic session\n",
    "session = fc.Session.get_or_create(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading and Preparing the Markdown Document\n",
    "\n",
    "This cell loads the academic paper from a Markdown file, creates a fenic DataFrame containing the document, and casts the content to a Markdown-specific type to enable further analysis and extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the academic paper markdown content from file\n",
    "paper_path = Path(\"attention_is_all_you_need.md\")\n",
    "with open(paper_path, 'r', encoding='utf-8') as f:\n",
    "    paper_content = f.read()\n",
    "\n",
    "# Create DataFrame with the paper content as a single row\n",
    "df = session.create_dataframe({\n",
    "    \"paper_title\": [\"Attention Is All You Need\"],\n",
    "    \"content\": [paper_content]\n",
    "})\n",
    "\n",
    "# Cast content to MarkdownType to enable markdown-specific functions\n",
    "df = df.select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.col(\"content\").cast(fc.MarkdownType).alias(\"markdown\")\n",
    ")\n",
    "\n",
    "print(\"=== PAPER LOADED ===\")\n",
    "result = df.select(fc.col('paper_title')).to_polars()\n",
    "print(f\"Paper: {result['paper_title'][0]}\")\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating a Table of Contents\n",
    "\n",
    "This cell uses fenic’s markdown functions to automatically generate a table of contents from the loaded academic paper, providing an overview of the document’s structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. Generate Table of Contents using markdown.generate_toc()\n",
    "toc_df = df.select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.markdown.generate_toc(fc.col(\"markdown\")).alias(\"toc\")\n",
    ")\n",
    "\n",
    "toc_df.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Document Sections\n",
    "\n",
    "This cell extracts all sections of the academic paper up to level 2 headers and converts them into a structured DataFrame. \n",
    "\n",
    "This enables further analysis and querying of individual document sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2. Extract all document sections and convert to structured DataFrame\n",
    "sections_df = df.select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.markdown.generate_toc(fc.col(\"markdown\")).alias(\"toc\"),\n",
    "    # Extract sections up to level 2 headers, returning array of section objects\n",
    "    fc.markdown.extract_header_chunks(fc.col(\"markdown\"), header_level=2).alias(\"sections\")\n",
    ").explode(\"sections\").unnest(\"sections\")  # Convert array to rows and flatten struct\n",
    "\n",
    "sections_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parsing the References Section\n",
    "\n",
    "This cell filters for the References section of the academic paper and splits its content to extract individual citations, enabling further analysis of the bibliography."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3. Filter for specific section (References) and parse its content\n",
    "references_df = sections_df.filter(\n",
    "    fc.col(\"heading\").contains(\"References\")\n",
    ")\n",
    "\n",
    "# Split references content on [1], [2], etc. patterns to separate individual citations\n",
    "references_df.select(\n",
    "    fc.text.split(fc.col(\"content\"), r\"\\[\\d+\\]\").alias(\"references\")\n",
    ").explode(\"references\").show()\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting References Using JSON and JQ\n",
    "\n",
    "This cell converts the Markdown document to a JSON structure and uses JQ queries to extract individual references from the References section. \n",
    "\n",
    "This approach enables precise parsing and structuring of citation data for further analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 4. Extract references using JSON + jq approach\n",
    "# Convert the original document to JSON structure\n",
    "document_json_df = df.select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.markdown.to_json(fc.col(\"markdown\")).alias(\"document_json\")\n",
    ")\n",
    "\n",
    "# Extract individual references using pure jq\n",
    "# References are nested under \"7 Conclusion\" -> \"References\" heading\n",
    "individual_refs_df = document_json_df.select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.json.jq(\n",
    "        fc.col(\"document_json\"),\n",
    "        # Navigate to References section and split text into individual citations\n",
    "        '.children[-1].children[] | select(.type == \"heading\" and (.content[0].text == \"References\")) | .children[0].content[0].text | split(\"\\\\n\") | .[]'\n",
    "    ).alias(\"reference_text\")\n",
    ").explode(\"reference_text\").select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.col(\"reference_text\").cast(fc.StringType).alias(\"reference_text\")\n",
    ").filter(\n",
    "    fc.col(\"reference_text\") != \"\"\n",
    ")\n",
    "\n",
    "individual_refs_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Reference Numbers and Content\n",
    "\n",
    "This cell uses a text extraction template to separate reference numbers from citation content in the References section, producing a structured DataFrame of individual citations for further analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract reference number and content using text.extract() with template\n",
    "print(\"Extracting reference numbers and content using text.extract():\")\n",
    "parsed_refs_df = individual_refs_df.select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.text.extract(\n",
    "        fc.col(\"reference_text\"),\n",
    "        \"[${ref_number:none}] ${content:none}\"\n",
    "    ).alias(\"parsed_ref\")\n",
    ").select(\n",
    "    fc.col(\"paper_title\"),\n",
    "    fc.col(\"parsed_ref\").get_item(\"ref_number\").alias(\"reference_number\"),\n",
    "    fc.col(\"parsed_ref\").get_item(\"content\").alias(\"citation_content\")\n",
    ")\n",
    "\n",
    "print(\"References with separated numbers and content:\")\n",
    "parsed_refs_df.show()\n",
    "print()\n",
    "\n",
    "# Clean up session resources\n",
    "session.stop()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
